[SPARK-3713][SQL] Uses JSON to serialize DataType objects#2563
[SPARK-3713][SQL] Uses JSON to serialize DataType objects#2563liancheng wants to merge 8 commits intoapache:masterfrom
Conversation
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
Tests timed out after a configured wait of |
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
There was a problem hiding this comment.
I think this comment is in the wrong place. We should probably note that this parser is deprecated and is only here for backwards compatibility. We might even print a warning when it is used so we can get rid of it eventually.
There was a problem hiding this comment.
Ah, this comment is a mistake. Instead of print a warning, I made fromCaseClassString() private. It's only referenced by CaseClassStringParser, which has already been marked as deprecated.
|
Minor comment otherwise this LGTM. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
you can have default implementation as:
self.__class__.__name__.[:-4].lower()
There was a problem hiding this comment.
Thanks for this, saved lots of boilerplate code! Removed all simpleString() method in subclasses.
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
why not just put _get_simple_string here? (it's not needed as separated functions, it will harder to understand without this context)
In order to make it available to class, it could be classmethod:
@classmethod
def simpleString(cls):
return cls.__name__[:-4].lower()
|
@davis Thanks for all the suggestions, really makes things a lot cleaner! |
|
QA tests have started for PR 2563 at commit
|
54c46ce to
785b683
Compare
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
|
Tests timed out after a configured wait of |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
If you like to use single string for Primitive types, it's still doable, only use one layer dict for others.
Either one is good to me.
|
This looks good to me, you just forget to rollback the changes in run-tests after debugging. |
|
@davies Sorry for my carelessness... And thanks again for all the great advices! |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
|
LGTM now, thanks! |
|
Could you rebase this to master? |
de18dea to
fc92eb3
Compare
|
Finished rebasing. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have finished for PR 2563 at commit
|
|
@marmbrus I think this is ready to go. |
|
Thanks! I've merged this. |
This PR uses JSON instead of
toStringto serializeDataTypes. The latter is not only hard to parse but also flaky in many cases.Since we already write schema information to Parquet metadata in the old style, we have to reserve the old
DataTypeparser and ensure downward compatibility. The old parser is now renamed toCaseClassStringParserand moved intoobject DataType.@JoshRosen @davies Please help review PySpark related changes, thanks!